109
function. For this purpose, one can bioinformatically perform a domain annotation, i.e.
which binding domains and functional sites are present, which thus provide information
about binding factors, but also the regulation and function of proteins. Databases such as
SMART, Prodom and Pfam provide information on proteins and domains and can also be
used to search a protein sequence for existing domains. Other important tools are the
BLAST algorithm, conserved domain or ELM servers, which allow the analysis and pre
diction of domains in unknown sequences.
Information on the metabolome (metabolism) can be obtained using mass spectroscopy
or gas chromatography. Metabolome sequencing is of interest to see how, for example,
metabolites change after a pathogenic infection or a drug, or how the metabolism of
humans and the pathogen differs. This is important, for example, for a potential pharma
ceutical to specifically affect the metabolism of a bacterium, but without producing a toxic
effect in humans. Important databases on biochemical metabolism include Roche
Biochemical Pathways, KEGG. The Metatool, YANA, YANAsquare or PLAS (Power Law
Analysis and Simulation) software are useful for investigating metabolism in more detail,
e.g. which metabolic fluxes are present or what effect changes in metabolic pathways have.
The large amounts of data that we can generate with modern techniques obviously help
much better to describe a biological system, such as the heart muscle.
On the other hand, it is clear that the crucial thing is to understand the underlying prin
ciples, as just explained for main and side effects and further illustrated by other central
system building blocks in this chapter. Therefore, one has two possibilities to describe a
complicated biological system:
First of all, knowledge-based research is used to elucidate the basic principles of the
biological system (for the myocardial cell in heart failure, see Figs. 5.1 and 5.2). Next, one
uses new data, preferably a great deal of it (nothing else is meant by “big data”), to sub
stantiate or modify the insights and hypotheses gained.
As you can see, relying only on the amount of data and large data sets is more a sign of
bias or inexperience. If I don’t have a clear hypothesis about the behavior of the system, I
have a much harder time reading the right thing from the data, or better yet, verifying it.
Even worse: “hypothesis free” research is mostly bad, even if advocates claim that one
would then be unbiased towards the results, because it is very easy to fall prey to chance.
Let’s illustrate this again with the gene expression dataset in heart failure. Let us assume
that we have measured 20,000 mRNAs and now want to understand, without a clear
hypothesis, which ones are increased in heart failure. Now, even if no objective differences
can be shown between drug and no drug, given 20,000 mRNAs, we would then purely by
chance find 1000 mRNAs that show a difference in expression between the two groups
with a p-value <0.05. Bioinformaticians and statisticians or experimenters, as experts in
large data sets, know this and therefore correct the statistics for such large data sets. This
is the correction for multiple testing, for example according to Bonferroni. In this correc
tion for many comparisons, the p-value is divided by the number of tests (n). For example,
for the 20,000 mRNAs, one would only accept differences with a p-value <0.0000025
(adjusted p-value). This is a very hard correction, but it applies to any distribution of
9.2 Opening Up Complex Systems Using Omics Techniques